Background Story

Gradient descent is not efficient in variational inference, because probability distributions do not naturally live in Euclidean space but rather on a statistical manifold. There are better ways of defining the distance between distributions, one of the simplest being the symmetrized Kullback-Leibler divergence:

K L s y m (p 1, p 2) = 1 2 (K L (p 1 | | p 2) + K L (p 2 | | p 1))

$\begin{align*} KL_{sym}(p_1,p_2) = \frac{1}{2}(KL(p_1||p_2) + KL(p_2||p_1)) \end{align*}$
In differential geometry, distance on a manifold is given by the bilinear form

∥ d ϕ ∥ 2 = ⟨ d ϕ, G (ϕ) d ϕ ⟩ = \sum i j g i j (ϕ) d ϕ i d ϕ j

$\|\mathrm{d}\phi\|^2 = \left < \mathrm{d}\phi, G(\phi)\mathrm{d}\phi \right > = \sum_{ij} g_{ij}(\phi)\mathrm{d}\phi_i \mathrm{d}\phi_j$
The matrix

G(ϕ)=[gij(ϕ)] $G(\phi) = [g_{ij} (\phi)]$ is called the Riemannian metric tensor.
In Euclidean space with an orthonormal basis

G(ϕ) $G(\phi)$ is simply the identity matrix. When

Φ $\Phi$ is a space of parameters of probability distributions and the symmetrized KL divergence is used to measure the distance between distributions then

G(ϕ) $G(\phi)$ turns out to be the Fisher information matrix:

( (θ)) i, j = E [(\partial \partial θ i log f (X; θ)) (\partial \partial θ j log f (X; θ)) ∣ ∣ ∣ θ] .

${\left(\mathcal{I} \left(\theta \right) \right)}_{i, j} = \operatorname{E} \left[\left. \left(\frac{\partial}{\partial\theta_i} \log f(X;\theta)\right) \left(\frac{\partial}{\partial\theta_j} \log f(X;\theta)\right) \right|\theta\right].$

The Story

In gradient ascent (of the evidence lower bound in variational inference), we want to maximize:

L (ϕ + d ϕ) = L (ϕ) + ϵ \nabla L (ϕ) T v

$\begin{align*} L(\phi + \mathrm d\phi) = L(\phi) + \epsilon \nabla L(\phi)^T v \end{align*}$
with constraint:

∥v∥2=⟨v,G(ϕ)v⟩=1 $\|v\|^2 = \langle v,G(\phi)v \rangle = 1$ . Solve with Lagrange mulitpliers, we get the natural gradient by multiplying the inverse of the Fisher information matrix and the first derivative:

G (ϕ) - 1 \nabla L (ϕ)

$G(\phi)^{-1}\nabla L(\phi)$

reference

The Natural Gradient: https://hips.seas.harvard.edu/blog/2013/01/25/the-natural-gradient/

Natural gradient

Background Story

The Story

reference